Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.
This is the first assigment on the University of Helsinki MOOC.
My git repository is in following address: https://github.com/toparvia/IODS-project.git
This is the first entry in the in the learning diary. The first lecture introduced basic methods of collaboration such as using Git for your code and writing reports using Markdown and Knitr. I have used these before, but I’m more familiar using them in UNIX than with Windows computers. I felt quite clumsy these, as in my own PhD project I don’t need to collaborate on the methods part as much. I chose to use Git hub desktop and RStudio as it was recommended in the course materials.
I also went through the R Short and Sweet. I think only trouble was with last assignment. I think my approach of function was in different order and it didn’t accept it at first. Finally I achieved to a solution that satisfied the test. I’m curious to see how the course will develop in the future.
Under here is just some tests of RMarkdown functions I tried:
inline equation: \(A = \pi*r^{2}\)
| Table Header | Second Header |
|---|---|
| Table Cell | Cell 2 |
| Cell 3 | Cell 4 |
Describe the work you have done this week and summarize your learning.
library(ggplot2)
p <-ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# Data exploration of the learning2014 with ggpairs
I found this code quite useful for getting an overview of the data. I have previously used a more simple plotting tools for initial analysis such as correlation or histograms separately, but I might use this one again since it provides insight with only one figure.
I looked in to the data provided by the excercise. The data has a huge number of different parameters, point estimates and some categorial variables. I used the common commands for simple analysis.
require(readr)
PKY <- read_delim("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
#working directory to source file location
setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) #works in Rstudio only
dim(PKY) #dimensions of the data
str(PKY) #structure of the data
summary(PKY) #summary of the data
I used also some other methods for data exploration, but I’ll write about those in the end of this excercise together with the data analysis.
library(data.table)
#Here we make each of the needed variables for practice
gender <- data.table(PKY$gender)
age <- data.table(PKY$Age)
attitude <- data.table(PKY)[, .(Da+Db+Dc+Dd+De+Df+Dg+Dh+Di+Dj)]
attitude <- attitude/10
deep <- data.table(PKY)[, .(D03+D11+D19+D27+D07+D14+D22+D30+D06+D15+D23+D31)]
deep <- deep/12
surf <- data.table(PKY)[, .(SU02+SU10+SU18+SU26+SU05+SU13+SU21+SU29+SU08+SU16+SU24+SU32)]
surf <- surf/12
stra <- data.table(PKY)[, .(ST01+ST09+ST17+ST25+ST04+ST12+ST20+ST28)]
stra <- stra/8
points <- data.table(PKY$Points)
#Combining variables to one data.table
learn2014 <- data.table(cbind(gender, age, attitude, deep, stra, surf, points))
#giving column names for the data.table
colnames(learn2014) <- c("gender", "age", "attitude", "deep", "stra", "surf", "points")
#removing the values with the points of zero
learn2014 <- learn2014[points >0]
#writing the file to the data folder
write.csv(learn2014, file = "data/learn2014.csv")
#reading the table
learn2014_reloaded <- read.csv(file = "data/learn2014.csv")
#showing the data summary with reloaded data
summary(learn2014_reloaded)
I produced the required data for with data.table -package, which I had used only briefly before. I thought this is good place to practice. However, I think I could have written this as one command, but I felt more confident doing this as one phase at the time. Next we start the analysis of the data. # Data analysis
learn2014 <- read.csv(file = "data/learn2014.csv")
require(GGally)
require(ggplot2)
#This function enables running linear regression to each parameter combination and comparison with loess (locally weighted scatterplot smoothing) to observe non-linear relationships between parameters, and their correlation for ggplot2 and GGally packages.
my_fn <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=loess, fill="red", color="red", ...) +
geom_smooth(method=lm, fill="blue", color="blue", ...)
p
}
g = ggpairs(learn2014,columns = c(2:7), lower = list(continuous = my_fn))
g
# Here we can see that combined parameters don't have high correlation between each other, and parameters compared to points, show linear positive relationship other than age and surf. Sex is excluded in this picture.
#Distributions of data look normally distritubed, at least roughly. Age is not normally distributed, perhaps it could be poisson distribution.
p <-ggpairs(learn2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
#Sex seems to affect attitude the most however these differences look rather small.
#I chose the model parameters based on the data exploration I did before.
#Attitude looked promising so it was in all the models.
m1 <- lm(points ~ attitude*deep*stra*surf+age, data=learn2014)
m2 <- lm(points ~ attitude + stra + surf + age + attitude:stra +
attitude:surf + stra:surf + attitude:stra:surf, data = learn2014)
m3 <- lm(points ~ attitude+deep+stra+surf+age, data=learn2014)
m4 <- lm(points ~ attitude + stra + age, data = learn2014)
m5 <- lm(points ~ attitude + stra, data = learn2014)
m6 <- lm(points ~ attitude + stra + gender, data = learn2014)
AIC(m1,m2,m3,m4,m5,m6)
## df AIC
## m1 18 1039.982
## m2 10 1031.755
## m3 7 1029.450
## m4 5 1028.225
## m5 4 1029.038
## m6 5 1030.978
#By Akaike's information criteria, the model with the best fit is m5, since it is also most parsimonius.
par(mfrow=c(2,2))
plot(m5)
#Model looks good. Residuals have quite even distribution on both sides of zero and there are only few outliers. The included parameters probably fulfil the normality asumption and are good predictor combination, since their correlation was low with each other (<0.06).
drop1(m5)
## Single term deletions
##
## Model:
## points ~ attitude + stra
## Df Sum of Sq RSS AIC
## <none> 4559.4 555.95
## attitude 1 1051.89 5611.3 588.41
## stra 1 81.74 4641.1 556.90
#to test whether more parameters would need to be dropped to improve the model.
#Looks like we have the most parsimonious model with the lowest AIC.
summary(m5)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learn2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
# Attitude (Global attitude toward statistics) is highly significant and stra (Organized Studying and Time management) is border case with p > 0.09. Model explains only 20 % of variation of the data. Perhaps further data wrangling might be in order or using better models.
I did some analysis to find the predictors of the full data, just out interest. I found a different set of predictors that were correlating with the points.
library(caret) #findCorrelation algorithm
library(ggcorrplot) #plot the correlations
#PKY is the full dataset
PKY2 <- PKY[-60] # removing the sex
correlations <- abs(cor(PKY2, use="pairwise.complete.obs"))
#correlation of all parameters with positive numbers
correlations.selected <- findCorrelation(correlations, cutoff = .7, exact = TRUE)
#colums with to select by cutoff 0.7
corr <- cor(PKY2[c(correlations.selected,59)])
#correlations of narrow selection and points variable
# Ploting correaltions
ggcorrplot(corr, hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("tomato2", "white", "springgreen3"),
title="Correlogram of PKY",
ggtheme=theme_bw)
What does this do? It takes all the predictors and find all the correlations from the data based on a cutoff, so basically you can look which questions have correlation with each other and then compares it with the value of interest, in this case, the points.
Also one other thing I found handy since I’ve worked with two different computers with this data set, that now I can easily migrate between and just use the git to move the data. I also found this function handy, so that I can use the same script in both computers.
#working directory to source file location
setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) #works in Rstudio only
Basically it sets the working directory to the same as you open the source without changing the settings of R-studio.
This time I had much less time to use for this exercise, so I used the methods given mostly.
Data Set Information:
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
# To empty the memory after the excercise before this
rm(list=ls())
# Load data
#setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
alc <- read.csv(file = "data/ex3alc.csv")
glimpse(alc)
## Observations: 382
## Variables: 36
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ school <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob <fctr> teacher, other, other, services, other, other, oth...
## $ reason <fctr> course, course, other, home, home, reputation, hom...
## $ nursery <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
#summary(alc)
# The data set is combined from two different files from the machine learning repository
#alc is the full dataset, we take numeric values for correlation analysis
alc2 <- select_if(alc, is.numeric)
correlations <- abs(cor(alc2, use="pairwise.complete.obs"))
#correlation of all parameters with absolute numbers
correlations.selected <- findCorrelation(correlations, cutoff = .3, exact = TRUE)
#colums with to select by cutoff 0.7
#Making sure that grades are part of the correlation table
grade.columns <- 15:17
correlations.selected <- union(correlations.selected,grade.columns )
# calculating the variables
corr <- cor(alc2[c(correlations.selected)])
#correlations of narrow selection and grades variables G1, G2, G3
# Ploting correaltions
ggcorrplot(corr, hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("tomato2", "white", "springgreen3"),
title="Correlogram of PKY",
ggtheme=theme_bw)
# Now we see the highest correlation to G1:G3 is with age, Medu, alc_use and Walc. Since alc_use and Walc have high correlation, only better one is selected for the analysis. G3 is the final grade, so we first try to only predict that and drop the G1 and G2, even thought it would be likely that they would be good predictors of G3.
# So we want to predict the Final grade of the students, with their alcohol use, age and their mothers education level.
# I hypotise that alcohol use would reduce the performance and mothers educational level would increase their performance.
my_fn <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=loess, fill="red", color="red", ...) +
geom_smooth(method=lm, fill="blue", color="blue", ...)
p
}
g = ggpairs(alc2,columns = c(correlations.selected), lower = list(continuous = my_fn))
g
# Lets add some factor variables to the data
names.selected <- colnames(alc2[correlations.selected])
alc3 <- subset(alc, select=c(names.selected, "sex", "school", "guardian", "activities"))
p <-ggpairs(alc3, mapping = aes(col = sex, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
# The categorial variables are quite similar and does not give much indication other than, that the high alcohol use is higher with males and almost no alcohol use is more common with females.
m <- glm(G3 ~ alc_use + Walc + age + Medu, family = "gaussian", data = alc3)
m2 <- glm(G3 ~ alc_use + age + Medu, family = "gaussian", data = alc3)
deviance(m2)/m2$null.deviance
## [1] 0.9052058
AIC(m,m2)
## df AIC
## m 6 1969.777
## m2 5 1969.543
par(mfrow=c(2,2))
plot(m2)
# Akaike information criteria gives similar result, so as hypotised the Walc is dropped as they have covariance, that way we achive the most parsimonious model.
# Model explains (R^2) 91 % of the variation without the help of G1 and G2.
# !!! Oh, I only realized at this point that the target variable was supposed to be high/low alcohol consumption, now I realize why logistic regression in this case... !!!
# I suppose I need to do another selection for parameters for this case..
# Feature selection based on LASSO (least absolute shrinkage and selection operator) using centering and scaling as preprosessing and then 10-fold crossvalidation
fit <- sbf(
form = alc$alc_use ~ .,
data = alc[c(2:27,29:31)], method = "glmnet", # Dalc and Walc are dropped as they are the parameters high_use is based on, D1:D3, dropped as well, since grades are known only after students are done with the studies (especially final exam G3)
tuneGrid=expand.grid(.alpha = .01, .lambda = .1),
preProc = c("center", "scale"),
trControl = trainControl(method = "none"),
sbfControl = sbfControl(functions = caretSBF, method = 'cv', number = 10)
)
fit
##
## Selection By Filter
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance:
##
## RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 0.3502 0.8844 0.2429 0.04492 0.02063 0.01825
##
## Using the training set, 15 variables were selected:
## sexM, age, addressU, Fjobservices, reasonother...
##
## During resampling, the top 5 selected variables (out of a possible 19):
## absences (100%), freetime (100%), goout (100%), reasonother (100%), sexM (100%)
##
## On average, 13.7 variables were selected (min = 11, max = 16)
# Dalc is dropped as it is one of the parameters high_use is based on
bm <- glm(high_use ~ failures + absences + freetime + age, data = alc, family = "binomial")
bm2 <- glm(high_use ~ failures + absences + freetime, data = alc, family = "binomial")
bm3 <- glm(high_use ~ absences + age + G1+ G2+ G3, data = alc, family = "binomial") # here grades are included
bm4 <- glm(high_use ~ absences + age + G1, data = alc, family = "binomial") # here grades are included
bm5 <- glm(high_use ~ failures + absences + freetime + age + goout, data = alc, family = "binomial")
bm6 <- glm(high_use ~ failures + absences + goout, data = alc, family = "binomial")
# Stepwise forward and backward selection
step(bm6, direction = "both")
## Start: AIC=406.4
## high_use ~ failures + absences + goout
##
## Df Deviance AIC
## <none> 398.40 406.40
## - failures 1 402.50 408.50
## - absences 1 410.92 416.92
## - goout 1 440.14 446.14
##
## Call: glm(formula = high_use ~ failures + absences + goout, family = "binomial",
## data = alc)
##
## Coefficients:
## (Intercept) failures absences goout
## -3.65199 0.41507 0.07547 0.71311
##
## Degrees of Freedom: 381 Total (i.e. Null); 378 Residual
## Null Deviance: 465.7
## Residual Deviance: 398.4 AIC: 406.4
AIC(bm,bm2,bm3,bm4,bm5,bm6) # better fit found without grades
## df AIC
## bm 5 439.7969
## bm2 4 439.4485
## bm3 6 451.1008
## bm4 4 447.3286
## bm5 6 409.0303
## bm6 4 406.3965
# bm6 is the best fit based, preprosessed parameters by centering and scaling, then using feature selection using LASSO with 10-fold crossvalidation , removing confounding parameters.
# The model was further validated with stepwise forward and backward selection and confirmed checking the Akaike information criteria.
par(mfrow=c(2,2))
plot(bm6)
# The model validation is bit more tricky with binomial models, you won't see such neat plots unless the number of observations is very high (N > 400). In this case plots look ok.
# These are loaded only now, since they mask some functions from dplyr
library(MASS)
library(fitdistrplus)
alc_num <- as.numeric(alc$high_use)
fitBinom=fitdist(data=alc_num, dist="binom", fix.arg=list(size=2), start=list(prob=0.3))
plot(pnbinom(sort(alc_num), size=2, mu=fitBinom$estimate), ppoints(alc_num))
abline(0,1)
# This shows the estimated binomial distribution in our response variable.
# compute odds ratios (OR)
OR <- coef(bm6) %>% exp
# compute confidence intervals (CI)
CI <- exp(confint(bm6))
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.02593937 0.0107041 0.05891273
## failures 1.51447465 1.0137625 2.27486120
## absences 1.07838996 1.0341084 1.12889344
## goout 2.04033617 1.6296451 2.58472531
# predictions
pred.bm6 <- predict(bm6, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = pred.bm6)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = pred.bm6 > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
alc[36:38] %>% tail(10)
## high_use probability prediction
## 373 FALSE 0.14055409 FALSE
## 374 TRUE 0.36139910 FALSE
## 375 FALSE 0.19198237 FALSE
## 376 FALSE 0.25734107 FALSE
## 377 FALSE 0.15979460 FALSE
## 378 FALSE 0.34330586 FALSE
## 379 FALSE 0.22362093 FALSE
## 380 FALSE 0.06224146 FALSE
## 381 TRUE 0.55365666 TRUE
## 382 TRUE 0.05797935 FALSE
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = c(high_use), y = probability, color = prediction))
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64397906 0.05759162 0.70157068
## TRUE 0.18062827 0.11780105 0.29842932
## Sum 0.82460733 0.17539267 1.00000000
# how many are correct
sum(alc[36]==alc[38])/382
## [1] 0.7617801
# 76.178 % are correct
1-sum(alc[36]==alc[38])/382
## [1] 0.2382199
# Training error is 23.821 % (DataCamp is 26 %)
# Here is the DataCamp version of training error, (which I find more complicated, but perhaps easier to follow).
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2382199
# The model is better than the one introduced in DataCamp
# K-fold cross-validation
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:survival':
##
## aml
## The following object is masked from 'package:lattice':
##
## melanoma
cv <- cv.glm(data = alc, cost = loss_func, glmfit = bm6, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2486911
Training error is 23.821 % (not validated) (DataCamp is > 26 %) Cross-validated training error is 24.8699 %
# I found this very useful in the DataCamp exercises
# produce summary statistics by group with piping variable #>#
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade=mean(G3))
This weeks exercise was interesting. In the end I tried some new methods I haven’t tried before, such as the LASSO feature selection. I wonder how the parameters were selected in the DataCamp exercises? Was the method in any way similar than the LASSO approach I used. I also tested LASSO without the preprosessing and the results differed. That model had much poorer results than the one with preprosessing. I’m curious how that can effect the LASSO results so much.
I completed all the DataCamp exercises prior to proceeding to this exercise. I also looked up some details how to use the RMarkdown more productively and hopefully this report will be clearer than the previous ones.
Today we are looking at the Boston crime
| Variables | Description |
|---|---|
| crim | per capita crime rate by town. |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft. |
| indus | proportion of non-retail business acres per town. |
| chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). |
| nox | nitrogen oxides concentration (parts per 10 million). |
| rm | average number of rooms per dwelling. |
| age | proportion of owner-occupied units built prior to 1940. |
| dis | weighted mean of distances to five Boston employment centres. |
| rad | index of accessibility to radial highways. |
| tax | full-value property-tax rate per $10,000. |
| ptratio | pupil-teacher ratio by town. |
| black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. |
| lstat | lower status of the population (percent). |
| medv | median value of owner-occupied homes divided by $1000s. |
knitr::opts_chunk$set(echo = TRUE, warning=FALSE)
require("easypackages") # for loading and installing many packages at one time
packages_to_load <- c("broom", "dplyr", "MASS", "tidyverse", "corrplot", "ggplot2", "GGally", "caret", "devtools", "ggthemes", "scales", "plotly")
packages(packages_to_load, prompt = TRUE) # lock'n'load install/load
#Additionally
#install_github("fawda123/ggord") # Installed from Github for vizualization
library(ggord)
# To empty the memory after the excercise before this
# rm(list=ls())
# Load data
# setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) # to set working directory to source file
# load the data
data("Boston")
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
All are numeric variables except chas and rad. The data has demographical data of boston including tax and other information possibly linked to crime rates within the city.
my_fn <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=loess, fill="red", color="red", ...) +
geom_smooth(method=lm, fill="blue", color="blue", ...)
p
}
g = ggpairs(Boston,columns = c(1:14), lower = list(continuous = my_fn))
g
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
More focused figure on correlation, since they are hardly visible with this many variables.
# Scaling the variables with mean and standard deviation with scale()
Boston_scaled <- as.data.frame(scale(Boston))
# create a quantile vector of crime and print it
bins <- quantile(Boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crim'
crime <- cut(Boston_scaled$crim, breaks = bins, include.lowest= TRUE, label = c("low","med_low","med_high", "high"))
# remove original crim from the dataset
Boston_scaled <- dplyr::select(Boston_scaled, -crim)
# add the new categorical value to scaled data
Boston_scaled <- data.frame(Boston_scaled, crime)
# number of rows in the Boston dataset
n <-nrow(Boston)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- Boston_scaled[ind,]
# create test set
test <- Boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <-test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
summary(train)
## zn indus chas
## Min. :-0.4872 Min. :-1.51549 Min. :-0.272329
## 1st Qu.:-0.4872 1st Qu.:-0.86902 1st Qu.:-0.272329
## Median :-0.4872 Median :-0.18028 Median :-0.272329
## Mean : 0.0033 Mean : 0.03352 Mean :-0.009206
## 3rd Qu.:-0.3532 3rd Qu.: 1.01499 3rd Qu.:-0.272329
## Max. : 3.8005 Max. : 2.42017 Max. : 3.664771
## nox rm age
## Min. :-1.4644 Min. :-3.44659 Min. :-2.33313
## 1st Qu.:-0.8452 1st Qu.:-0.58052 1st Qu.:-0.78067
## Median :-0.1441 Median :-0.12401 Median : 0.38812
## Mean : 0.0514 Mean :-0.01551 Mean : 0.04799
## 3rd Qu.: 0.7966 3rd Qu.: 0.49866 3rd Qu.: 0.91834
## Max. : 2.7296 Max. : 3.47325 Max. : 1.11639
## dis rad tax
## Min. :-1.26582 Min. :-0.98187 Min. :-1.31269
## 1st Qu.:-0.83538 1st Qu.:-0.63733 1st Qu.:-0.75495
## Median :-0.33209 Median :-0.52248 Median :-0.42268
## Mean :-0.03155 Mean : 0.05459 Mean : 0.06195
## 3rd Qu.: 0.61991 3rd Qu.: 1.65960 3rd Qu.: 1.52941
## Max. : 3.95660 Max. : 1.65960 Max. : 1.79642
## ptratio black lstat medv
## Min. :-2.70470 Min. :-3.9033 Min. :-1.52961 Min. :-1.9063
## 1st Qu.:-0.48756 1st Qu.: 0.2017 1st Qu.:-0.74682 1st Qu.:-0.6913
## Median : 0.25149 Median : 0.3837 Median :-0.13696 Median :-0.1721
## Mean : 0.01459 Mean :-0.0200 Mean : 0.04816 Mean :-0.0273
## 3rd Qu.: 0.80578 3rd Qu.: 0.4331 3rd Qu.: 0.63603 3rd Qu.: 0.2683
## Max. : 1.63721 Max. : 0.4406 Max. : 3.54526 Max. : 2.9865
## crime
## low : 98
## med_low : 89
## med_high:105
## high :112
##
##
Now we have generated the test dataset with the new categorial variable “crime”, which is based on the quantiles of the original numeric variable.
# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2425743 0.2202970 0.2599010 0.2772277
##
## Group means:
## zn indus chas nox rm
## low 1.05611797 -0.9539497 -0.15180559 -0.8847310 0.46084014
## med_low -0.08617034 -0.2158360 -0.00690658 -0.5553086 -0.09722015
## med_high -0.38025152 0.1175536 0.17762524 0.3765069 0.02939201
## high -0.48724019 1.0169208 -0.06141298 1.0478378 -0.40947736
## age dis rad tax ptratio
## low -0.8542196 0.9609043 -0.7029578 -0.7323671 -0.43618146
## med_low -0.3488940 0.3223541 -0.5611968 -0.4396129 -0.08896864
## med_high 0.3973215 -0.3679215 -0.4076377 -0.3219809 -0.29663576
## high 0.8253033 -0.8658225 1.6401200 1.5154800 0.78309558
## black lstat medv
## low 0.37524015 -0.77999636 0.524660469
## med_low 0.34304712 -0.11341692 -0.008331595
## med_high 0.09939918 0.03604687 0.143270083
## high -0.76627702 0.91253805 -0.685264527
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.131659777 0.708316409 -0.78209681
## indus -0.050784099 -0.313038607 0.64354579
## chas -0.008044673 -0.051096072 0.10481470
## nox 0.418782526 -0.541165008 -1.47846665
## rm 0.012765542 -0.001436898 0.04122097
## age 0.303432313 -0.230379508 -0.40368805
## dis -0.104954903 -0.068615211 0.02136377
## rad 3.192497180 0.666766947 -0.04642300
## tax -0.014156153 0.236978820 0.50168061
## ptratio 0.148347587 0.113054637 -0.33483539
## black -0.114889332 -0.001692874 0.16208788
## lstat 0.155073861 -0.202889925 0.40692745
## medv 0.059628404 -0.310170031 -0.30539105
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9524 0.0346 0.0130
g <- ggord(lda.fit, train$crime, ellipse_pro = 0.95, vec_ext = 2, alpha_el = 0.3, size = 2)
g + theme_tufte() + geom_rangeframe()
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 18 11 0 0
## med_low 7 15 15 0
## med_high 0 3 17 1
## high 0 0 0 15
In the high -crime quartile, we could predict all cases, but for other classes we didn’t do so well. However, LDA seems to be able to predict our cases with quite good accuracy.
# k-means clustering
BostonS <- scale(Boston) # standardize variables
wss <- (nrow(BostonS)-1)*sum(apply(BostonS,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(BostonS,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
km <-kmeans(BostonS, centers = 4)
# plot the Boston dataset with clusters
pairs(BostonS, col = km$cluster)
We see similar results than in LDA
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
Now we can see the 3D picture of the clusters!
I completed all the DataCamp exercises prior to proceeding to this exercise. Today we are foraying to the mystical world of GNI and then it’s time for tea.
I found this tip for making the Rmarkdown neater. This looks bit similar solution than I used. I stored the css in seperate file that the Knitr can load from “styles.css”. Look at the index file if interested.
MFirst the GNI.
| Variables | Description |
|---|---|
| Edu2.FM | Secondary education Female/male index |
| Labo.F | Female labour participation rate |
| Life.Exp | Life expentancy at birth |
| Edu.Exp | Educational expentancy |
| GNI | Gender Development index |
| Mat.Mor | Mortality of mothers |
| Ado.Birth | Adolecent birth rate |
| Parli.F | Parlamentary represtantion of females |
require("easypackages") # for loading and installing many packages at one time
packages_to_load <- c("broom", "dplyr", "tidyverse", "corrplot", "ggplot2", "GGally", "devtools", "ggthemes", "stringr", "FactoMineR", "factoextra")
packages(packages_to_load, prompt = TRUE) # lock'n'load install/load
human <- read.table("data/humaf.csv", sep =",", header = T, row.names = 1)
glimpse(human)
## Observations: 155
## Variables: 8
## $ Edu2.FM <dbl> 1.0072389, 0.9968288, 0.9834369, 0.9886128, 0.969060...
## $ Labo.FM <dbl> 0.8908297, 0.8189415, 0.8251001, 0.8840361, 0.828611...
## $ Life.Exp <dbl> 81.6, 82.4, 83.0, 80.2, 81.6, 80.9, 80.9, 79.1, 82.0...
## $ Edu.Exp <dbl> 17.5, 20.2, 15.8, 18.7, 17.9, 16.5, 18.6, 16.5, 15.9...
## $ GNI <int> 64992, 42261, 56431, 44025, 45435, 43919, 39568, 529...
## $ Mat.Mor <int> 4, 6, 6, 5, 6, 7, 9, 28, 11, 8, 6, 4, 8, 4, 27, 2, 1...
## $ Ado.Birth <dbl> 7.8, 12.1, 1.9, 5.1, 6.2, 3.8, 8.2, 31.0, 14.5, 25.3...
## $ Parli.F <dbl> 39.6, 30.5, 28.5, 38.0, 36.9, 36.9, 19.9, 19.4, 28.2...
ggpairs(
human, 1:8,
lower = list(combo = wrap("facethist", binwidth = 0.5
)))
ggduo(
human, 5, c(1:4,6:8),
types = list(continuous = "smooth_loess", mapping = aes(color = GNI)),
title = "GNI constituents loess smooth",
xlab = "",
ylab = ""
)
human_std <- scale(human) %>% as.data.frame()
ggpairs(
human_std, 1:8,
lower = list(combo = wrap("facethist", binwidth = 0.5
)))
ggduo(
human_std, 5, c(1:4,6:8),
types = list(continuous = "smooth_loess", mapping = aes(color = GNI)),
title = "GNI constituents loess smooth",
xlab = "",
ylab = ""
)
The scaled dataset looks very similar, however, some extreme values are now drawn more closer.
# Running the principal component analysis'
pca_human <- prcomp(human)
pca_human_std <- prcomp(human_std)
# comparison summaries
s_pca_human <- summary(pca_human)
s_pca_human_std <- summary(pca_human_std)
s_pca_human
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
s_pca_human_std
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
# Generating labels
pca_pr <- round(100*s_pca_human$importance[2, ], digits = 1)
pca_pr_std <- round(100*s_pca_human_std$importance[2, ], digits = 1)
pca_pr_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
pca_pr_lab_std <- paste0(names(pca_pr_std), " (", pca_pr_std, "%)")
#Plotting 2
biplot(
pca_human, cex = c(0.8, 1),
col = c("grey40", "deeppink2"),
xlab = pca_pr_lab[1],
ylab = pca_pr_lab[2],
main= "Principal component analysis of GNI components (unstandardized)"
)
biplot(
pca_human_std,
cex = c(0.8, 1),
col = c("black", "green2"),
xlab = pca_pr_lab_std[1],
ylab = pca_pr_lab_std[2],
main= "Principal component analysis of GNI components (standardized)"
)
Non-scaled results are non-usable. All the nearly all deviation are explained by the first dimension as the uneven scale makes the big numbers have huge importance.
| Variables | Description | PC1 | PC2 |
|---|---|---|---|
| Edu2.FM | Secondary education Female/male index | -0.35664370 | 0.03796058 |
| Labo.F | Female labour participation rate | 0.05457785 | 0.72432726 |
| Life.Exp | Life expentancy at birth | -0.44372240 | -0.02530473 |
| Edu.Exp | Educational expentancy | -0.42766720 | 0.13940571 |
| GNI | Gender Development index | -0.35048295 | 0.05060876 |
| Mat.Mor | Mortality of mothers | 0.43697098 | 0.14508727 |
| Ado.Birth | Adolecent birth rate | 0.41126010 | 0.07708468 |
| Parli.F | Parlamentary represtantion of females | -0.08438558 | 0.65136866 |
PCA splits the data to eight dimensions where first three explain 79 % of the variance. First two dimensions explain 69 %. Horizontal dimension is explained by education and life expectancy and Gender development index, and to the right (positive) dimension is explained by mortality of mothers and adolecent brith rate. 2nd dimension is explained by parlamentary representation of females and labor participation ratio.
data(tea)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
tea_time %>% summary()
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
#str(tea_time)
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
plot(mca, habillage = "quali")
# Screeplot dimension explaining for variance
fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 50))
#Strongest 1 dimension
fviz_mca_ind(mca, habillage = "how",
addEllipses = TRUE, repel = TRUE)
#Stronggest 2 dimension
fviz_mca_ind(mca, habillage = "where",
addEllipses = TRUE, repel = TRUE)
fviz_mca_var(mca, repel = TRUE)